# Pipelining

## A Pipelined Datapath

- Five stages, one step per stage
  - 1. IF: Instruction fetch from memory
  - 2. ID: Instruction decode & register read
  - 3. EX: Execute operation or calculate address
  - 4. MEM: Access memory operand
  - 5. WB: Write result back to register
- Need to add pipeline registers between stages

## 5 Steps of DLX Datapath

I-type instruction 6 5 5 16

Opcode rs rt Immediate

Encodes: Loads and stores of bytes, half words, words, double words. All immediates (rt -- rs op immediate)

Conditional branch instructions (rs is register, rd unused)
Jump register, jump and link register
(rd = 0, rs = destination, immediate = 0)

R-type instruction



Register-register ALU operations: rd + rs funct rt
Function encodes the data path operation: Add, Sub, . .
Read/write special registers and moves

J-type instruction



Jump and jump and link
Trap and return from exception



| Stage | Any instruction                                                                                                                                                        |                                                                                                            |                                                                   |  |  |  |
|-------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------|--|--|--|
| IF    | <pre>IF/ID.IR ← Mem[PC]; IF/ID.NPC,PC ← (if ((EX/MEM.opcode == branch) &amp; EX/MEM.cond){EX/MEM. ALUOutput} else {PC+4});</pre>                                       |                                                                                                            |                                                                   |  |  |  |
| ID    | <pre>ID/EX.A ← Regs[IF/ID.IR[rs]]; ID/EX.B ← Regs[IF/ID.IR[rt]]; ID/EX.NPC ← IF/ID.NPC; ID/EX.IR ← IF/ID.IR; ID/EX.Imm ← sign-extend(IF/ID.IR[immediate field]);</pre> |                                                                                                            |                                                                   |  |  |  |
|       | ALU instruction                                                                                                                                                        | Load or store instruction                                                                                  | Branch instruction                                                |  |  |  |
| EX    | EX/MEM.IR ← ID/EX.IR;<br>EX/MEM.ALUOutput ←<br>ID/EX.A func ID/EX.B;<br>or<br>EX/MEM.ALUOutput ←<br>ID/EX.A op ID/EX.Imm;                                              | EX/MEM.IR to ID/EX.IR<br>EX/MEM.ALUOutput ←<br>ID/EX.A + ID/EX.Imm;                                        | <pre>EX/MEM.ALUOutput ← ID/EX.NPC + (ID/EX.Imm &lt;&lt; 2);</pre> |  |  |  |
|       | ID/LX:A OP ID/LX:IIIIII,                                                                                                                                               | EX/MEM.B ← ID/EX.B;                                                                                        | <pre>EX/MEM.cond ← (ID/EX.A == 0);</pre>                          |  |  |  |
| MEM   | MEM/WB.IR ← EX/MEM.IR;<br>MEM/WB.ALUOutput ←<br>EX/MEM.ALUOutput;                                                                                                      | <pre>MEM/WB.IR ← EX/MEM.IR; MEM/WB.LMD ← Mem[EX/MEM.ALUOutput]; or Mem[EX/MEM.ALUOutput] ← EX/MEM.B;</pre> |                                                                   |  |  |  |
| WB    | Regs[MEM/WB.IR[rd]] ← MEM/WB.ALUOutput; or Regs[MEM/WB.IR[rt]] ← MEM/WB.ALUOutput;                                                                                     | For load only: Regs[MEM/WB.IR[rt]] ← MEM/WB.LMD;                                                           |                                                                   |  |  |  |

## Visualizing Pipelining





Read: second half of the cycle

Write: first half of the cycle



## Basic Performance issues in pipelining: An example

- Unpipelined processor:
  - Clock cycle time: 1 ns
  - □ ALU: 40%, uses 4 cycles
  - □ Branches: 20%: uses 4 cycles
  - Memory: 40%: uses 5 cycles
- Pipelined processor:
  - Clock cycle time: 1.2 ns
  - Ignoring any latency impact (Ideal CPI on a pipelined machine = 1)
- How much speedup?

#### Unpipelined:

- Average instruction exe time
  - = clock cycle time\* average CPI
  - = 1 ns \*(0.4\*4+0.2\*4+0.4\*5) = 4.4 ns

#### Pipelined:

- Average instruction exe time
  - = clock cycle time\* ideal CPI=1.2 ns \* 1 = 1.2 ns
- Speedup = 4.4/1.2 = 3.7 times

## Limits to pipelining

# Hazards prevent next instruction from executing during its designated clock cycle

- <u>Structural hazards</u>: HW cannot support this combination of instructions (single person to fold and put clothes away)
- <u>Data hazards</u>: Instruction depends on result of prior instruction still in the pipeline (missing sock)
- Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

### Hazards => Stalls

#### Time (clock cycles)



# Speed Up Equation for Pipelining

CPI<sub>pipelined</sub> = Ideal CPI + Average Stall cycles per Inst

$$Speedup = \frac{CPI_{unpipelind}}{CPI_{pipelined}} \times \frac{Cycle \ Time_{unpipelind}}{Cycle \ Time_{pipelined}}$$

$$Speedup = \frac{Ideal \, CPI \times Pipeline \, depth}{Ideal \, CPI + Pipeline \, stall \, CPI} \times \frac{Cycle \, Time_{unpipeline}}{Cycle \, Time_{pipelined}}$$

Ideal CPI on a pipelined machine = 1

Speedup = 
$$\frac{\text{Pipeline depth}}{1 + \text{Pipeline stall CPI}} \times \frac{\text{Cycle Time}_{\text{unpipelined}}}{\text{Cycle Time}_{\text{pipelined}}}$$

### Structural Hazard

- If some combination of instructions cannot be accommodated because of resource conflicts, the machine is said to have a structural hazard
- Common solution: stall the pipeline until no hazard in the pipeline
- A stall is commonly called a pipeline bubble

## One Memory Port/Structural Hazards



## One Memory Port/Structural Hazards

#### Time (clock cycles)



## Example

- Data reference: 40%
- Processor with structural hazard has a clock rate 1.05 times higher than the clock rate of the processor without the hazard
- Discarding any other performance losses
- Which processor is faster?

#### Discussion on Structural Hazard

- If all other factors are equal, a processor without structural hazards will have a lower CPI
- Why would a designer allow structural hazard?
  - To reduce cost of the unit
  - Eg. Supporting both an instruction cache and data cache every cycle requires twice the total memory bandwidth
  - Eg. Functional unit to deal with Floating point instruction

#### Data Hazard

Data hazards occur when the pipeline changes the order or read/write accesses to operands so that the order differs from the order seen sequentially executing instructions on an un-pipelined machine

#### Data Hazard on R1

#### Time (clock cycles)



## Three Generic Data Hazards

Read After Write (RAW)
Instr<sub>J</sub> tries to read operand before Instr<sub>I</sub> writes it

```
I: add r1,r2,r3
J: sub r4,r1,r3
```

 Caused by a "Dependence" (in compiler nomenclature). This hazard results from an actual need for communication.

### Three Generic Data Hazards

Write After Read (WAR)
 Instr<sub>J</sub> writes operand <u>before</u> Instr<sub>I</sub> reads it

```
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
```

- Called an "anti-dependence" by compiler writers.
   This results from reuse of the name "r1".
- Can't happen in DLX 5 stage pipeline because:
  - All instructions take 5 stages, and
  - Reads are always in stage 2, and
  - Writes are always in stage 5

#### Three Generic Data Hazards

Write After Write (WAW)
 Instr<sub>J</sub> writes operand <u>before</u> Instr<sub>I</sub> writes it.

```
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
```

- Called an "output dependence" by compiler writers
   This also results from the reuse of name "r1".
- Can't happen in DLX 5 stage pipeline because:
  - All instructions take 5 stages, and
  - Writes are always in stage 5
- Will see WAR and WAW in more complicated pipelines

# Minimizing Data Hazard Stalls by Forwarding

- Data hazard problem can be solved with a hardware technique called forwarding (also called bypassing)
- A result is forwarded from the output of one unit to the input of another
- Not all potential data hazards can be handled by forwarding

## Forwarding to Avoid Data Hazard

#### Time (clock cycles)





## Data Hazard Even with Forwarding

#### Time (clock cycles)





## Data Hazard Even with Forwarding

Time (clock cycles) I n 5 t r. sub r4,r1,r6 0 r d and r6,r1,r7

## Software Scheduling to Avoid Load Hazards

Try producing fast code for

$$a = b + c;$$
  
 $d = e - f;$ 

assuming a, b, c, d ,e, and f in memory.



# Implementing the Control for DLX Pipleline

 If a data hazard exists, the instruction is stalled

 Detecting interlocks early in the pipeline reduces the hardware complexity

 Detect the hazard or forwarding at the beginning of a clock cycle that uses an operand

# Load Interlock can be detected by detection hardware

| Situation                               | Example code<br>sequence  LD R1,45(R2) DADD R5,R6,R7 DSUB R8,R6,R7 OR R9,R6,R7 |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | No hazard possible because no dependence exists on R1 in the immediately following three instructions.                                                    |  |  |
|-----------------------------------------|--------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| No dependence                           |                                                                                |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                                                                                                                                           |  |  |
| Dependence requiring stall              | LD<br>DADD<br>DSUB<br>OR                                                       | The state of the s | Comparators detect the use of R1 in the DADD and stall the DADD (and DSUB and 0R) before the DADD begins EX.                                              |  |  |
| Dependence<br>overcome by<br>forwarding | LD<br>DADD<br>DSUB<br>OR                                                       | R1,45(R2)<br>R5,R6,R7<br>R8,R1,R7<br>R9,R6,R7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | Comparators detect use of R1 in DSUB and forward result of load to ALU in time for DSUB to begin EX.                                                      |  |  |
| Dependence with accesses in order       | LD<br>DADD<br>DSUB<br>OR                                                       | R1,45(R2)<br>R5,R6,R7<br>R8,R6,R7<br>R9,R1,R7                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | No action required because the read of R1 by 0R occurs in the second half of the ID phase, while the write of the loaded data occurred in the first half. |  |  |

## Logic to detect the Load Interlocks

| Previous instruction                            |                                                 | Reg-reg ALU has two source reg  |  |  |
|-------------------------------------------------|-------------------------------------------------|---------------------------------|--|--|
| Opcode field of ID/EX (ID/EX.IR <sub>05</sub> ) | Opcode field of IF/ID (IF/ID.IR <sub>05</sub> ) | Matching operand fields         |  |  |
| Load                                            | Register-register ALU                           | ID/EX.IR[rt] == IF/ID.IR[rs]    |  |  |
| Load                                            | Register-register ALU                           | ID/EX.IR[rt] == IF/ID.IR[rt]    |  |  |
| Load                                            | Load, store, ALU immediate or branch            | e, ID/EX.IR[rt] == IF/ID.IR[rs] |  |  |

Reg-reg ALU (R type)

| ор     | rs     | rt     | rd     | shamt  | funct  |
|--------|--------|--------|--------|--------|--------|
| 6 bits | 5 bits | 5 bits | 5 bits | 5 bits | 6 bits |

Load, store, ALU immediate, branch (I type)

| ор     | rs     | rt     | constant or address |
|--------|--------|--------|---------------------|
| 6 bits | 5 bits | 5 bits | 16 bits             |

# Forwarding Conditions 1/2

| Pipeline register<br>containing source<br>instruction | Opcode<br>of source<br>instruction | Pipeline register containing destination instruction | Opcode of destination instruction                               | Destination<br>of the<br>forwarded<br>result | Comparison (if equal then forward) |
|-------------------------------------------------------|------------------------------------|------------------------------------------------------|-----------------------------------------------------------------|----------------------------------------------|------------------------------------|
| EX/MEM                                                | Register-<br>register ALU          | ID/EX                                                | Register-register ALU,<br>ALU immediate, load,<br>store, branch | Top ALU<br>input                             | EX/MEM.IR[rd] ==<br>ID/EX.IR[rs]   |
| EX/MEM                                                | Register-<br>register ALU          | ID/EX                                                | Register-register ALU                                           | Bottom ALU input                             | EX/MEM.IR[rd] ==<br>ID/EX.IR[rt]   |
| MEM/WB                                                | Register-<br>register ALU          | ID/EX                                                | Register-register ALU,<br>ALU immediate, load,<br>store, branch | Top ALU<br>input                             | MEM/WB.IR[rd] ==<br>ID/EX.IR[rs]   |
| MEM/WB                                                | Register-<br>register ALU          | ID/EX                                                | Register-register ALU                                           | Bottom ALU input                             | MEM/WB.IR[rd] ==<br>ID/EX.IR[rt]   |



## Forwarding Conditions 2/2





| Pipeline register containing source instruction | Opcode<br>of source<br>instruction | Pipeline<br>register<br>containing<br>destination<br>instruction | Opcode of destination instruction                               | Destination<br>of the<br>forwarded<br>result | Comparison (if equal then forward) |
|-------------------------------------------------|------------------------------------|------------------------------------------------------------------|-----------------------------------------------------------------|----------------------------------------------|------------------------------------|
| EX/MEM                                          | ALU<br>immediate                   | ID/EX                                                            | Register-register ALU,<br>ALU immediate, load,<br>store, branch | Top ALU input                                | EX/MEM.IR[rt] ==<br>ID/EX.IR[rs]   |
| EX/MEM                                          | ALU immediate                      | ID/EX                                                            | Register-register ALU                                           | Bottom ALU<br>input                          | EX/MEM.IR[rt] ==<br>ID/EX.IR[rt]   |
| MEM/WB                                          | ALU<br>immediate                   | ID/EX                                                            | Register-register ALU,<br>ALU immediate, load,<br>store, branch | Top ALU input                                | MEM/WB.IR[rt] ==<br>ID/EX.IR[rs]   |
| MEM/WB                                          | ALU<br>immediate                   | ID/EX                                                            | Register-register ALU                                           | Bottom ALU input                             | MEM/WB.IR[rt] ==<br>ID/EX.IR[rt]   |
| MEM/WB                                          | Load                               | ID/EX                                                            | Register-register ALU,<br>ALU immediate, load,<br>store, branch | Top ALU<br>input                             | MEM/WB.IR[rt] ==<br>ID/EX.IR[rs]   |
| MEM/WB                                          | Load                               | ID/EX                                                            | Register-register ALU                                           | Bottom ALU input                             | MEM/WB.IR[rt] ==<br>ID/EX.IR[rt]   |

## HW Change for Forwarding



## Control Hazards

- Control hazards can cause a greater performance loss for the DLX pipeline than do data hazards
- When a branch is executed, it may or may not change the PC
  - Taken
  - Untaken

#### Control Hazard on Branches

Three Stage Stall

10: beq r1,r3,36

14: and r2,r3,r5

18: or r6,r1,r7

22: add r8, r1, r9

36: xor r10, r1, r11

What do you do with the 3 instructions in between?

How do you do it?

Where is the "commit"?



### Branch Stall Impact

If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9

```
Op Freq Cycles
Other 70% 1
Branch 30% 4
0.7*1+0.3*4=0.7+1.2=1.9
```

- Two part solution:
  - Determine branch taken or not sooner, AND
  - Compute taken branch address earlier
- DLX branch tests if register = 0 or ≠ 0
- DLX Solution:
  - Move Zero test to ID stage
  - Adder to calculate new PC in ID stage
  - 1 clock cycle penalty for branch versus 3

### Revised pipeline structure



#### Four Branch Hazard Alternatives

- #1: Stall until branch direction is clear
- #2: Predict Branch Not Taken
  - Execute successor instructions in sequence
  - "Squash" instructions in pipeline if branch actually taken
  - Advantage of late pipeline state update
  - 47% DLX branches not taken on average
  - PC+4 already calculated, so use it to get next instruction

#### #3: Predict Branch Taken

- 53% DLX branches taken on average
- But haven't calculated branch target address in DLX
  - DLX still incurs 1 cycle branch penalty
  - Other machines: branch target known before outcome

#### Four Branch Hazard Alternatives

#### #4: Delayed Branch

Define branch to take place AFTER a following instruction

```
\begin{array}{c} \text{branch instruction} \\ \text{sequential successor}_1 \\ \text{sequential successor}_2 \\ \text{sequential successor}_n \\ \end{array}
```

- 1 slot delay allows proper decision and branch target address in 5 stage pipeline
- DLX uses this

### Delayed Branch



# Delayed-branch scheduling schemes and their requirements

| Scheduling Strategy | Requirements                                                           | Improves performance when |
|---------------------|------------------------------------------------------------------------|---------------------------|
| From Before         | Branch must not depend on the rescheduled instruction                  | always                    |
| From target         | Must be OK to execute rescheduled instructions if branch is not taken. | When branch is taken      |
| From fall through   | Must be OK to execute instructions if branch is taken                  | When branch is not taken  |

#### Delayed Branch

- Where to get instructions to fill branch delay slot?
  - Before branch instruction
  - From the target address: only valuable when branch taken
  - From fall through: only valuable when branch not taken
  - Canceling branches allow more slots to be filled
- Compiler effectiveness for single branch delay slot:
  - Fills about 60% of branch delay slots
  - About 80% of instructions executed in branch delay slots useful in computation
  - About 50% (60% x 80%) of slots usefully filled
- Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar): cannot easily hide the long delays
  - More powerful prediction scheme is desired for deep pipelines

## Evaluating Branch Alternatives

Pipeline speedup = 
$$\frac{\text{Pipeline depth}}{1 + \text{Branch frequency} \times \text{Branch penalty}}$$

| Scheduling        | Branch  | CPI  | speedup v.  | speedup v. |
|-------------------|---------|------|-------------|------------|
| scheme            | penalty |      | unpipelined | stall      |
| Stall pipeline    | 3       | 1.42 | 3.5         | 1.0        |
| Predict taken     | 1       | 1.14 | 4.4         | 1.26       |
| Predict not taker | n 1     | 1.09 | 4.5         | 1.29       |
| Delayed branch    | 0.5     | 1.07 | 4.6         | 1.31       |

# What makes pipelining hard to implement

- Dealing with exceptions
- Stopping and restarting execution
- Instruction set complication

### Exceptions of DLX pipeline

| Pipeline stage | Problem exceptions occurring                                                            |
|----------------|-----------------------------------------------------------------------------------------|
| IF             | Page fault on instruction fetch; misaligned memory access; memory- protection violation |
| ID             | Undefined or illegal op-code                                                            |
| EX             | Arithmetic Exception                                                                    |
| MEM            | Page Fault on data fetch; misaligned memory access; memory- protection violation        |
| WB             | None                                                                                    |

# Extending the DLX pipeline to handle multiple operations



### Latencies and initiation intervals for

#### function units

# of intervening cycles between an instruction that produces a result and an instruction that uses the result.

| Functional unit | Latency       | Initial interval |                                                 |  |  |  |
|-----------------|---------------|------------------|-------------------------------------------------|--|--|--|
|                 | (clock cycle) | (Repeat int      | er य)                                           |  |  |  |
| Integer ALU     | 0             | 1                | # of clock cycles                               |  |  |  |
| Data memory     | 1             | 1                | that must elapse between issuing two operations |  |  |  |
| FP add          | 3             | 1                |                                                 |  |  |  |
| FP multiply     | 6             | 1                | of a given type                                 |  |  |  |
| FP divide       | 24            | 25               |                                                 |  |  |  |

 As we can observe form the table, EX stage needs not only one cycle

# A pipeline that supports multiple outstanding FP operations



# A pipeline that supports multiple outstanding FP operations

- The divide unit is not fully pipelined. Structural hazards can occur.
- The instructions have varying running times. The number of register writes required in a cycle can be larger than 1.
- Instructions can complete in a different order than they were issued.
- Stalls for RAW hazards are more frequent.

## Multiple Write Back Simultaneously

|                |    | Clock cycle number |    |    |     |            |    |     |     |     |    |  |  |
|----------------|----|--------------------|----|----|-----|------------|----|-----|-----|-----|----|--|--|
| Instruction    | 1  | 2                  | 3  | 4  | 5   | 6          | 7  | 8   | 9   | 10  | 11 |  |  |
| MUL.D F0,F4,F6 | IF | ID                 | M1 | M2 | M3  | M4         | M5 | M6  | M7  | MEM | WB |  |  |
| •••            |    | IF                 | ID | EX | MEM | WB         |    |     |     |     |    |  |  |
| •••            |    |                    | IF | ID | EX  | MEM        | WB |     |     |     |    |  |  |
| ADD.D F2,F4,F6 |    |                    |    | IF | ID  | <b>A</b> 1 | A2 | A3  | A4  | MEM | WB |  |  |
| •••            |    |                    |    |    | IF  | ID         | EX | MEM | WB  |     |    |  |  |
| •••            |    |                    |    |    |     | IF         | ID | EX  | MEM | WB  |    |  |  |
| L.D F2,0(R2)   |    |                    |    |    |     |            | IF | ID  | EX  | MEM | WB |  |  |

#### RAW hazards

- The longer pipeline raises the frequency of stalls.
- In this example, each instruction is dependent on the previous one.

|        |          | Clock cycle number |    |    |       |    |       |       |       |       |       |       |     |    |       |       |       |     |
|--------|----------|--------------------|----|----|-------|----|-------|-------|-------|-------|-------|-------|-----|----|-------|-------|-------|-----|
| Instru | ction    | 1                  | 2  | 3  | 4     | 5  | 6     | 7     | 8     | 9     | 10    | 11    | 12  | 13 | 14    | 15    | 16    | 17  |
| L.D    | F4,0(R2) | IF                 | ID | EX | MEM   | WB |       |       |       |       |       |       |     |    |       |       |       |     |
| MUL.D  | F0,F4,F6 |                    | IF | ID | stall | M1 | M2    | M3    | M4    | M5    | M6    | M7    | MEM | WB |       |       |       |     |
| ADD.D  | F2,F0,F8 |                    |    | IF | stall | ID | stall | stall | stall | stall | stall | stall | A1  | A2 | A3    | A4    | MEM   | WB  |
| S.D    | F2,0(R2) |                    |    |    |       | IF | stall | stall | stall | stall | stall | stall | ID  | EX | stall | stall | stall | MEM |

### The MIPS R4000 Pipeline

- The eight pipeline stages of MIPS R4000
  - □ IF First half of instruction fetch
  - IS Second half of instruction fetch
  - RF Instruction decode and register fetch
  - □ EX Execution
  - DF First half of data cache access
  - DS Second half of data cache access
  - □ TC Tag check (check cache hit or miss)
  - □ WB Write back

## The eight stage pipeline structure of MIPS R4000



© 2003 Elsevier Science (USA). All rights reserved.

# A two-cycle load delay of R4000 integer pipeline



#### The basic branch delay of R4000 pipeline



© 2003 Elsevier Science (USA). All rights reserved.

### Summary: Longer latency pipelines

- The divide unit is not fully pipelined, structural hazards can occur.
- Varying running time
  - Number of register writes required in one cycle can be > 1
  - WAW hazards are possible (not reach WB in order)
  - out of order completion causing problems with exception
- Stalls for RAW will be more frequent
- WAR is not possible in this case since the register reads always occurs in ID